The objectives of this assignment are to build a command line retrieval engine on top of the inverted files created in Assignment 3. The retrieval engine should take in queries which are lists of words, run them through the same preprocessing as the collection (i.e., downcase, optionally remove stopwords) and then match the query words against the inverted file to come up with document weights (the sum of the term weights in the document). Display the ten top-ranking document identifiers or filenames to the user.
You may use any algorithm you choose to locate the postings, but your report should discuss your choices (i.e., do binary search on index file on disk versus load entire index into memory and search there). Use some appropriate data structure to avoid keeping the whole TDM in memory as a fixed array of floats.
Entries in the TDM shall be non-negative.
The query must run in a single command from the linux command
line, i.e., not be an interactive menu-based system.
For example
"retrieve dog cat bird" would be an example query.
You are welcome to add query term weights, e.g., retrieveWt 0.3 dog 0.4 cat 0.3 bird. Weights may be negative. This implies, for example, that documents on cats will score higher than documents on dogs or birds, other things being equal. If you add query term weights, there is a 15% bonus.
Hand In
Sample queries and the top ten document identifiers).
Test your system with the following queries:
Also, hand in listings of your code and a project report which outlines your algorithms, your data structures, and any files created. Discuss the complexity (roughly) of your algorithms. Include output with the queries listed above.
A sample session would look like:
$ cd /appropriate/directory
$ retrieve dog cat mouse
013.html 0.8
404.html 0.3
332.html 0.25
...
the last of the top ten files
If no files with non-zero scores exist, print an appropriate message. Don't bother printing files with scores of zero.
As usual, prepare a report with your code and the output listed above, and submit via Blackboard.